Skip to content

feat(index): support configurable multi-segment FM-Index builds#7123

Merged
jackye1995 merged 12 commits into
lance-format:mainfrom
beinan:beinan/fmindex-segmented-index
Jun 10, 2026
Merged

feat(index): support configurable multi-segment FM-Index builds#7123
jackye1995 merged 12 commits into
lance-format:mainfrom
beinan:beinan/fmindex-segmented-index

Conversation

@beinan

@beinan beinan commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add num_segments parameter to FM-Index creation that distributes dataset fragments across multiple independent index segments
  • Each segment is a complete FM-Index covering a disjoint set of fragments, enabling incremental indexing and segment merge
  • Wire FM-Index into Lance's segmented index infrastructure (commit_existing_index_segments, merge_existing_index_segments)
  • Add FMINDEX type mapping and num_segments kwarg in Python bindings

Usage

# Single segment (default, backward compatible)
ds.create_index("text_col", index_type="FMIndex")

# Multi-segment: fragments distributed across 4 segments
ds.create_index("text_col", index_type="FMIndex", num_segments=4)

Test plan

  • Existing 20 FM-Index unit tests pass (verified locally)
  • Build compiles clean (cargo check -p lance -p lance-index)
  • cargo fmt passes
  • CI passes
  • Test multi-segment build with real dataset

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added A-python Python bindings A-index Vector index, linalg, tokenizer A-format On-disk format: protos and format spec docs and removed A-python Python bindings A-index Vector index, linalg, tokenizer A-format On-disk format: protos and format spec docs labels Jun 5, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@beinan beinan changed the title Support configurable multi-segment FM-Index builds feat(index): support configurable multi-segment FM-Index builds Jun 5, 2026
@github-actions github-actions Bot added enhancement New feature or request A-python Python bindings A-index Vector index, linalg, tokenizer A-format On-disk format: protos and format spec docs labels Jun 5, 2026
@beinan beinan marked this pull request as ready for review June 5, 2026 07:41

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

@beinan beinan force-pushed the beinan/fmindex-segmented-index branch from 5822f75 to 40bec4d Compare June 5, 2026 19:46

@jackye1995 jackye1995 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a correctness issue in segmented FMIndex search and a few behavioral divergences from the existing segmented-index pattern. Details inline.

Comment thread rust/lance/src/index/scalar_logical.rs Outdated
),
};
let match_count = row_addrs.true_rows().row_addrs().unwrap().count();
assert!(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is too weak and appears to hide a correctness bug in segmented FMIndex search. FMIndex training currently does not request _rowid or _rowaddr, so collect_texts falls back to a local counter starting at 0 for each segment, and search returns those ids as row addresses. With this fixture, "quick" should match 5 rows, but segmented search can collapse local ids across segments and still satisfy >= 3.

Can we fix FMIndex to store global row addresses / stable row ids and make this test assert the exact expected rows or count?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I fixed this by:

  1. Updating FMIndexPlugin to request global row addresses (.with_row_addr()) during training.
  2. Updating collect_texts to parse the global _rowaddr (falling back to _rowid if needed), rather than falling back to local counter starting at 0.
  3. Adding a new search_row_addrs method on FMIndex to return the full 64-bit row address during logical searches, avoiding any 32-bit offset truncation.
  4. Tightening the test assertion to assert exactly 5 matching rows.

Comment thread rust/lance/src/index/create.rs Outdated
}

self.dataset
.commit_existing_index_segments(&index_name, &column, segment_metadatas)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit path does not preserve the normal replace=true behavior. The standard execute path removes all existing same-name indices when replace is set, but commit_existing_index_segments only removes overlapping same-type segments. Replacing an existing same-name BTree/bitmap/etc. index with segmented FMIndex can leave mixed metadata under one index name, and empty/untrained FMIndex coverage can leave old same-type segments too.

Can this path either build the transaction with the same removed_indices as the normal path, or extend the segment commit API with explicit replace semantics?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I replaced the commit_existing_index_segments calls with explicit TransactionBuilder + apply_commit, collecting all same-name indices into removed_indices when replace=true. This matches the standard execute path behavior and ensures replacing a BTree/bitmap index with a segmented FMIndex properly cleans up all old metadata.


/// Build FM-Index with multiple segments, each covering a subset of fragments.
async fn execute_multi_segment_fmindex(&mut self, num_segments: u32) -> Result<IndexMetadata> {
let column_input = &self.columns[0];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bypasses the standard execute_uncommitted validation that requires exactly one column. For num_segments > 1, an empty column list will panic here, and a multi-column request silently indexes only the first column instead of returning the existing error.

Can we run the same self.columns.len() != 1 validation before entering the segmented FMIndex path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. execute_multi_segment_fmindex now runs the same self.columns.len() != 1 check at entry, matching execute_uncommitted validation.


let mut fragment_bitmap = RoaringBitmap::new();
for segment in &segments {
fragment_bitmap |= segment.fragment_bitmap.as_ref().cloned().ok_or_else(|| {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merge path should mirror the btree merge behavior and intersect stored segment coverage with the dataset's current live fragments. As written, it unions historical fragment ids and passes them to build_scalar_index; after compaction retires one of those fragments, the rebuild can fail with No fragment with id ....

Can we use effective live coverage here, and add a compaction/retired-fragment merge test similar to the btree coverage?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. merge_segments now intersects the unioned fragment bitmap with dataset.fragment_bitmap to drop retired/compacted fragments before rebuilding, mirroring the btree merge behavior. When the intersection is empty (all fragments retired), it produces an empty index instead of crashing. Added test_fmindex_merge_after_compaction_drops_retired_fragments to cover this scenario.

beinan pushed a commit to beinan/lance that referenced this pull request Jun 9, 2026
… FM-Index builds

- Fix local row address bug in collect_texts by requesting _rowaddr and returning u64 row addresses in search_row_addrs
- Rebuild segmented index commit path in create.rs using TransactionBuilder to properly handle replace=true with same-name index removal
- Run single-column validation at the entry of execute_multi_segment_fmindex
- Intersect segment fragment coverage with live dataset fragments in merge_segments to prevent failures after compaction retires fragments
- Add test_fmindex_merge_after_compaction_drops_retired_fragments integration test

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
let value_col = batch.column(0);
// Prefer _rowaddr (global row address) over _rowid to ensure stable,
// globally unique identifiers across segments.
let row_addrs: &arrow_array::UInt64Array = batch

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] This new hard requirement for _rowaddr also needs to be reflected in FMIndex update/optimize paths. FMIndexScalarIndex::update_criteria() still returns TrainingCriteria::new(TrainingOrdering::None), so optimize_indices() / append maintenance can build a training stream with only the value column and fail here. Also, FMIndexScalarIndex::update() currently writes a fresh index from new_data only and ignores the existing index plus old_data_filter; if we only add .with_row_addr(), the generic single-segment scalar update path can replace the old FMIndex with one containing only appended rows. Can we either make FMIndex update merge existing rows correctly, or force FMIndex maintenance to rebuild from the full target fragment bitmap instead of taking the single-segment update() shortcut?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed both issues:

  1. update_criteria() now returns requires_old_data with .with_row_addr(), so the training stream includes all existing + new rows with global row addresses.
  2. update() now applies the old_data_filter to exclude deleted/compacted rows before rebuilding the BWT, so the single-segment update path produces a complete index covering both old and new data instead of silently dropping existing rows.

Comment thread rust/lance-index/src/scalar/fmindex.rs Outdated
let texts = collect_texts(new_data).await?;
let mut texts = collect_texts(new_data).await?;
if let Some(filter) = old_data_filter {
texts.retain(|(row_addr, _)| match &filter {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] This filter drops the appended rows in the optimize/append path. Because update_criteria() now uses requires_old_data, load_unindexed_training_data() passes fragments=None and scans all current rows, not just old rows. But the old_data_filter here only describes the selected old segment coverage, so retaining only to_keep / valid removes the unindexed appended fragments from texts before rebuilding. I think FMIndex either needs to rebuild from the intended union fragment set without this old-only filter, or the update path needs a filter that keeps both selected old rows and new/unindexed rows while excluding deleted/retired old rows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The filter now excludes to_remove fragments instead of keeping only to_keep, so both old valid rows and new unindexed/appended rows are retained in the rebuilt index.

@github-actions github-actions Bot added the A-java Java bindings + JNI label Jun 9, 2026
Beinan Wang added 8 commits June 9, 2026 17:22
Add num_segments parameter to FM-Index creation that distributes dataset
fragments across multiple independent index segments. Each segment is a
complete FM-Index covering a disjoint subset of fragments, enabling
incremental indexing and segment merge support.

- Add num_segments field to FMIndexIndexDetails proto
- Add multi-segment build path in CreateIndexBuilder that splits
  fragments into groups, builds one segment per group, and commits
  atomically via commit_existing_index_segments
- Add segment_has_fmindex_details predicate and FM-Index branch in
  merge_existing_index_segments dispatch (merge = rebuild from source)
- Add dataset-level fmindex::merge_segments function
- Add FMINDEX type mapping and num_segments kwarg in Python bindings

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Segment topology is managed by the manifest (fragment_bitmap on
IndexMetadata), not by index-type-specific protos. num_segments
flows through ScalarIndexParams JSON at build time only.

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
- Keep FMIndexIndexDetails empty (segment topology is managed by manifest)
- Collapse nested if-let in dataset.rs to satisfy clippy collapsible_if

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
- test_fmindex_segments_commit_and_query_as_logical_index: builds one
  segment per fragment, commits, queries across segments
- test_fmindex_segments_merge_and_query: builds two segments, merges
  into one, verifies query results on merged index
- Fix segment_has_fmindex_details predicate (FMIndexIndexDetails
  not FmIndexIndexDetails)

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
…-segment path

- Enforce existing-index name collision check (replace=false now errors)
- Honor train=false by building an empty index instead of scanning
- Handle empty datasets (zero fragments) without panicking

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
… FM-Index builds

- Fix local row address bug in collect_texts by requesting _rowaddr and returning u64 row addresses in search_row_addrs
- Rebuild segmented index commit path in create.rs using TransactionBuilder to properly handle replace=true with same-name index removal
- Run single-column validation at the entry of execute_multi_segment_fmindex
- Intersect segment fragment coverage with live dataset fragments in merge_segments to prevent failures after compaction retires fragments
- Add test_fmindex_merge_after_compaction_drops_retired_fragments integration test

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
Beinan Wang and others added 4 commits June 9, 2026 17:23
…ew data

- Change update_criteria() to requires_old_data with .with_row_addr()
  so optimize_indices/append maintenance streams include both existing
  and new rows with global row addresses
- Apply old_data_filter in update() to exclude deleted/compacted rows
  before rebuilding the BWT, preventing the single-segment update
  shortcut from silently dropping existing indexed rows

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
…FMIndex update

The old_data_filter.to_keep only covers old segment fragments, so
filtering by it dropped all new/appended rows. Use to_remove exclusion
instead so both old valid rows and new unindexed rows are retained.

Co-Authored-By: Beinan Wang <beinanwang@microsoft.com>
@jackye1995 jackye1995 force-pushed the beinan/fmindex-segmented-index branch from 16f611d to eba411d Compare June 10, 2026 00:43
@jackye1995 jackye1995 merged commit 70f5695 into lance-format:main Jun 10, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-format On-disk format: protos and format spec docs A-index Vector index, linalg, tokenizer A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants